This is the title

Tony Liang

University of British Columbia

December 31, 2023

About myself

  • PhD student in Bioinformatics under Dr. Amrit Singh supervision

  • BSc. in Math + minor in Data Science

  • What does “bioinformatician” do?

    • Assist researchers like you to better understand what your data means
    • We’re just coders that know little bit more bio
  • Currently working in creating pipeline, tools, models analyzing biological data in an automated-fashion

    • Focused on machine learning & AI
    • Reproducible workflows

Single Cell RNA-seq

What is single cell RNA sequencing?

Loading data

The data is from a study (Kang et al. 2018) and publicly avaliable through R’s ExperimentHub function

eh <- ExperimentHub() #  Initialize the hub as some list object
sce <- eh[["EH2259"]] # We could then extract the match entry by taking this entry from out hub
# Then print it
sce
class: SingleCellExperiment 
dim: 35635 29065 
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
  • After loaded data should inspect basic information
  • What are rows? column? size of data?
  • Rows = genes
  • Columns = cells

Preprocessing of data before analysis

The data retrieved is rawest form, not all of it is suitable for analysis.

Caveats:

  • Undetected genes
  • Cells with very few or many detected genes
  • Lowly expressed genes
  • unnormalized expression values

This is usually the quality control (QC) step. This could potentially be another tutorial, so not deeply covered today.

We only perfom simple actions

Remove undetected genes

sce <- sce[rowSums(counts(sce) > 0) > 0, ]
sce
class: SingleCellExperiment 
dim: 18890 29065 
metadata(0):
assays(1): counts
rownames(18890): RP11-34P13.8 AL627309.1 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
  • Reduced from \(35635\) genes to \(18890\), nearly 16K genes that were not detected in any cells

Remove cells with few or many detected genes

# We need additional support to compute per cell quality control metrics from scater package
qc <- scater::perCellQCMetrics(sce)
# remove cells with few or many detected genes
ol <- scater::isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
# Then we remove those are consider as outlier
sce <- sce[, !ol] # This means retain column that are not ol
dim(sce)
[1] 18890 26820
  • Now we changed 29065 cells to 26820 cells, where these cells are either overly abundant or too few

Remove lowly expressed genes

# Similar to early, see pattern now with rowSums
# But we want to at least have 10 cells that high expression value, this threshold could change
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]
sce
class: SingleCellExperiment 
dim: 7118 26820 
metadata(0):
assays(1): counts
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
  • Interesting now, we only retained 7118 genes from original 35K and 26820 cells from original 29K.
  • This would be our final “filtered data”, but not entirely ready for analysis
    • Need to normalize these expression values

Normalize expression values

Calculate a log2-transformed normalized expression values:

  • dividing each count by its size factor
  • adding pseudo count of 1
  • log transforming
# compute sum-factors & normalize
sce <- scater::computeLibraryFactors(sce)
sce <- scater::logNormCounts(sce)
sce
class: SingleCellExperiment 
dim: 7118 26820 
metadata(0):
assays(2): counts logcounts
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(6): ind stim ... multiplets sizeFactor
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

This is our finalized data that have gone through series of QC steps. Now, we move to muscat

MUSCAT

multi-sample multi-group scRNA-seq analysis tools (Crowell et al. 2020)

It expects a SCE and requires cell metadata columns to have:

  • sample_id : sample identifier i.e. Nautilus_trt_3
  • cluster id: subpopulation (cluster assignment) i.e. T cells, monocytes
  • group id: experimental group/condition i.e. control/treatment, healthy/diseased
sce$id <- paste0(sce$stim, sce$ind)
sce <- muscat::prepSCE(sce, 
    kid = "cell", # subpopulation assignments
    gid = "stim",  # group IDs (ctrl/stim)
    sid = "id",   # sample IDs (ctrl/stim.1234)
    drop = TRUE)  # drop all other colData columns
sce
class: SingleCellExperiment 
dim: 7118 26820 
metadata(1): experiment_info
assays(2): counts logcounts
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(3): cluster_id sample_id group_id
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

MUSCAT

Sample level analysis (pseudobulk)

Cell level analysis (Important, what we care, but limited)

  • Fit linear mixed model for each gene to cell level measurement data

Under the hood, these involves statistics modelling, mainly linear regression in high dimension setting.

Sample level

pb <- muscat::aggregateData(sce,
    assay = "counts", fun = "sum",
    by = c("cluster_id", "sample_id"))
# Note it changed to 16 clusters, which is pre-defined earlier
pb
class: SingleCellExperiment 
dim: 7118 16 
metadata(2): experiment_info agg_pars
assays(8): B cells CD14+ Monocytes ... Megakaryocytes NK cells
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(16): ctrl101 ctrl1015 ... stim1256 stim1488
colData names(1): group_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

Sample level – model fitting

ei <- metadata(sce)$experiment_info
set.seed(3)
dplyr::sample_n(ei, 3)
  sample_id group_id n_cells
1  ctrl1015     ctrl    2880
2  stim1488     stim    2779
3  stim1256     stim    2127
# This is actual model
mm <- model.matrix(~ 0 + ei$group_id)
dimnames(mm) <- list(ei$sample_id, levels(ei$group_id))
# construct design & contrast matrix
contrast <- limma::makeContrasts("stim-ctrl", levels = mm)
# run DS analysis
sample_res <- muscat::pbDS(pb, design = mm, contrast = contrast, verbose = F)
# We have a list of top tables for each cluster
tbl <- sample_res$table[[1]]
names(tbl)
[1] "B cells"           "CD14+ Monocytes"   "CD4 T cells"      
[4] "CD8 T cells"       "Dendritic cells"   "FCGR3A+ Monocytes"
[7] "Megakaryocytes"    "NK cells"         

Sample level – between cluster concordance

  • which genes are DE in only a single (or very few) clusters?
  • How many DE genes are shared between clusters?
# filter FDR < 5%, abs(logFC) > 1 & sort by adj. p-value
tbl_fil <- lapply(tbl, function(u) {
  u <- dplyr::filter(u, p_adj.loc < 0.05, abs(logFC) > 1)
  dplyr::arrange(u, p_adj.loc)
})

de_gs_by_k <- purrr::map(tbl_fil, "gene")
UpSetR::upset(UpSetR::fromList(de_gs_by_k))

Sample level – pseudobulk heatmap top N DS genes per cluster

# top-5 DS genes per cluster
muscat::pbHeatmap(sce, sample_res, top_n = 5)
  • Good when wanting to gain an overview of numerous DE testing results for many clusters

Sample level – pseudobulk heatmap top N DS genes for single cluster

# top-5 DS genes per cluster
muscat::pbHeatmap(sce, sample_res, k = "B cells")

Moreover, it provides a set of options regarding which cluster(s), gene(s), and comparison to include (arguments k, g and c, respectively).

Sample level – pseudobulk heatmap for single gene all clusters

# single gene across all clusters
muscat::pbHeatmap(sce, sample_res, g = "ISG20")
  • Also could visualize cluster-sample means of a single gene of interest across all clusters
    • Identify cell-types that are affected similarly by different experimental conditions

Cell level

scater::plotExpression(sce[, sce$cluster_id == "B cells"],
  features = tbl_fil$`B cells`$gene[seq_len(6)],
  x = "sample_id", colour_by = "group_id", ncol = 3) +
  guides(fill = guide_legend(override.aes = list(size = 5, alpha = 1))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
  • For changes of high interest, we can view the cell-level expression profiles of a specific gene across samples or groups

Conclusion

  • scRNA-seq data is messy by nature — preprocessing is essential before any analysis.
  • We used scater for:
    • Removing undetected/low-quality genes & cells
    • Normalizing expression values
  • MUSCAT provides series of convenient functions to analyze scRNA-seq data
    • Although it focuses on bulk-level by aggregating them cells

Thanks!

  • Dr. Amrit Singh
  • Dr. Young Woong Kim
  • Dr. Maryam Ahmadzadeh
  • Rishika Daswani
  • Roy He
  • Michael Yoon
  • Jeffrey Tang
  • Akshdeep Sandhu
  • Yovindu Don
  • Raam Sivakumar
  • Prabhleen Sandhu
  • Mingming Zhang
  • Samuel Leung

Reference

Crowell, Helena L, Charlotte Soneson, Pierre-Luc Germain, Daniela Calini, Ludovic Collin, Catarina Raposo, Dheeraj Malhotra, and Mark D Robinson. 2020. “Muscat Detects Subpopulation-Specific State Transitions from Multi-Sample Multi-Condition Single-Cell Transcriptomics Data.” Nature Communications 11 (1): 6077.
Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94.
Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 1–21.
Ritchie, Matthew E, Belinda Phipson, DI Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–47.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.